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Abstract —This paper looks into the problem of pedestrian 
tracking using a monocular, potentially moving, uncalibrated 
camera. The pedestrians are located in each frame using a 
standard human detector, which are then tracked in subsequent 
frames. This is a challenging problem as one has to deal with 
complex situations like changing background, partial or full 
occlusion and camera motion. In order to carry out successful 
tracking, it is necessary to resolve associations between the 
detected windows in the current frame with those obtained from 
the previous frame. Compared to methods that use temporal 
windows incorporating past as well as future information, we 
attempt to make decision on a frame-by-frame basis. An occlusion 
reasoning scheme is proposed to resolve the association problem 
between a pair of consecutive frames by using an affinity matrix 
that defines the closeness between a pair of windows and then, 
uses a binary integer programming to obtain unique association 
between them. A second stage of verification based on SURF 
matching is used to deal with those cases where the above 
optimization scheme might yield wrong associations. The efficacy 
of the approach is demonstrated through experiments on several 
standard pedestrian datasets. 

1. Introduction 

In this paper, we look into the problem of tracking multi¬ 
ple targets using a monocular, possibly moving, uncalibrated 
camera. It has several applications in areas like smart vehicles, 
robotics and video surveillance. It can be used for extracting 
higher level of information from a video, such as, event 
detection, crowd analysis etc. The task involves locating con¬ 
cerned targets, assigning unique IDs to each one of them and 
generating trajectories for them. The problem is challenging as 
one has to deal with several complex situations like changing 
background, camera motion, wide variation in appearance and 
illumination and, partial or full occlusion. 

One of the popular approach is to use tracking-by-detection 
framework which has become one of the popular approach 
to solve this problem. In this framework, a detector is used 
to locate targets in each frame and then associate these 
detections across frames. This approach however suffers from 
the limitations of the object detector which may yield false 
positives and missing detections. On the other hand, resolv¬ 
ing associations between detected targets across frames may 
become challenging under conditions of group formation and 
occlusions for long duration. 

In most of the methods, the data association problem is 
solved by optimizing the detection assignments over a temporal 
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window Cl El El. Nevada’s group, particularly, focused on 
hierarchical association at multiple levels El H El where 
the tracklets are associated to form longer trajectories. The 
association is formulated as a MAP problem which is solved 
using the Hungarian algorithm. These were mostly off-line 
approaches where the frames were revisited over multiple 
iterations. In their latest work O, authors learn a Conditional 
Random Eield (CRE) model to learn appearance and motion 
model that also takes into account relative positions between 
the targets. There are other approaches that make use of 
particle filter to solve the tracking problem as in (71 (H. 

In this paper, we re-look at the multi-target tracking 
problem with a focus on simplifying the entire approach. We 
primarily focus on using monocular camera images in contrast 
to other methods that use stereo-vision system 0 d, laser 
scanner CD, night vision ifTTl or LIDAR ifTTl O, sometimes 
in addition to vision. We aim to make online decisions on a 
frame-by-frame basis unlike other approaches where a tem¬ 
poral window is used for incorporating future information for 
resolving association in the current frame 0. Such methods 
are prone to frequent ID switches and trajectory fragmentation 
due to noisy and ambiguous observation 0. We attempt to 
overcome these limitations of a frame-by-frame approach in 
this paper. We use the standard JRoG detector as used by 
the authors in 0 as we are primarily making comparison 
with their results. However, any other object detector could 
be used locating pedestrians in the video. Readers may refer 
to C3 CSI CD for a survey on the state of the art methods 
in pedestrian detection. Once a new person is detected in a 
frame, a colour-based mean-shift tracker is initialized. This 
mean-shift (MS) tracker ifT^ combined with a Kalman Eilter 
(kf)|[ID based motion predictor is used to localize this target 
in the new frame. In order to carry out successful tracking, it is 
necessary to resolve association between the currently detected 
target windows with those estimated from the previous frame 
using KE and MS tracker. This is challenging as the windows 
may overlap with each other resulting in many-to-one or one- 
to-many associations. The need for an occlusion reasoning 
scheme in a multi-agent tracking problem is illustrated in 
Eigure[T] It is shown that the agent IDs get interchanged during 
occlusion when the associations are not properly and hence, 
there is a need for having an effective occlusion reasoning 
scheme. 

Our main contribution lies in proposing an occlusion rea¬ 
soning scheme (ORS) that uses an affinity matrix and binary 
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Fig. 1: Need for Occlusion Reasoning Scheme (ORS) in multi¬ 
agent tracking. Without ORS, agent IDs are interchanged 
between the overlapping agents 1 and 2. This ID switch is 
prevented using ORS. 


integer programming to resolve the data association problem 
between a pair of frames. The affinity matrix represents the 
‘closeness’ between a pair of tracking windows. The binary 
integer programming (BIP) module returns unique associations 
between the agents of this pair. Since the resolution is based 
on some feature based scalar value function in the affinity 
matrix, the resulting association might still be incorrect in 
some extreme cases. We use agent pairing information from 
the last frame and SURF matching to provide a second stage of 
verification over the decision obtained from the BIP module. 
The resulting algorithm is tested on several datasets and the 
performance is compared with the state of the art. 

It is to be noted that one can use Hungarian algorithm 
Q H in place of BIP for resolving associations. However, 
we envisage that BIP or linear programming would allow us 
to incorporate constraints which are not confined to be the 
elements of a matrix, as is the case with the Hungarian algo¬ 
rithm. One such use case is demonstrated in ca. However, 
in the present scenario, Hungarian algorithm is found to be 
computationally more efficient compared to BIP in obtaining 
same solution. 

Even though, it is an initial work with a lot of scope for 
improvement, we believe that the material presented in this 
paper would provide a lot of useful insights which can be 
appreciated by the readers. The rest of this paper is organized 
as follows. The proposed method is provided in Section [III 
The analysis of experimental results is provided in Section |III| 
followed by conclusion in Section ITVl 

II. Proposed Approach 



Fig. 2: Outline of our approach for pedestrian tracking. The 
proposed occlusion reasoning scheme is shown in the red box. 


Each person detected or tracked is called an agent which is 
represented by a bounding box (BB) surrounding the person 
and is labeled with a global ID. The agents for a given frame 
Ik is represented by the symbol i = 1, 2, • • • , n, where n 
is the number of agents that are found by the detector in the 
frame. 

The proposed method for carrying out pedestrian detection 
and tracking is shown as a fiowchart in Figure [2l The method 
consists of four steps. The first step involves applying a 
human detector to locate pedestrians in each frame. The 
second step involves estimating the location of agents from 
the last frame using a tracker that uses mean-shift algorithm 
ca and a Kalman Filter ca. It is necessary to associate 
these agents from the last frame with those obtained from 
the detector in the current frame. The association simply 
means assigning appropriate labels or IDs to the currently 
detected target windows. The problem becomes difficult when 
the agents come together to form groups or undergo partial 
or full occlusion. We propose an occlusion reasoning scheme 
to solve this association problem between the past agents and 
the currently detected target windows. This reasoning scheme 
is explained next in this section. Once the associations are 
resolved, the list agents is updated by adding new agents which 
are found in the current frame. A list for the pairs of agents 
which overlap with each other is also maintained which is also 
essential for dealing with the cases of occlusion. 


In order to explain our approach, we would use the 
following notations. A given video sequence is represented 
by the symbol J/c, = indicating that the video 

has a total of N frames. As stated earlier, any standard 
human detector is used to locate pedestrians in each frame. 


A. Occlusion reasoning scheme 

The occlusion reasoning scheme includes four main steps. 
The first step involves creating an affinity matrix between the 
estimated agent windows obtained from the last frame (using 









































tracker) and the persons detected in the current frame by the 
detector. This matrix is utilized in the next step to resolve 
association between these two groups of agents using binary 
integer programming. The resulting associations may contain 
few errors arising out of difficult cases like occlusion. Hence 
a second stage of verification based on SURF matching and 
pairing information from the last frame. Once the associations 
with the previous agents are resolved, the newly detected 
agents are given new agent IDs. The details for each of these 
steps are provided below. 


1) Affinity Matrix: An illustration of an affinity matrix for 
a given frame is shown in Figure [S] Let us assume that the 
number of agents found by the detector in the current frame Ik 
is n while the number of agents obtained from the last frame 
(Ik-i) is m. The location of these agents from the last frame 
is estimated using KF+MS tracker. These estimated agents are 
represented by the symbol A^_^. Hence the affinity matrix Sk 
for this frame has a dimension mxn with each element having 
a value obtained from a scalar function given by: 


Sk{iJ) = f{0, BC) = + a2BC{iJ) (1) 


where 0{i,j) is the percentage overlap between the two 
bounding boxes given by 


0{i,j) 


n Ak{j) 

U Afe(j) 


( 2 ) 


and BC{i^j) is the Bhattacharya Coefficient computed be¬ 
tween the corresponding bounding boxes representing similar¬ 
ity based on histogram matching. The weights ai and a 2 are 
normalized weights which are decided a priori indicating the 
relative importance of individual factors in the overall function. 
The values of the matrix elements Sk{i,j) lie between 0 and 
1, 0 being no overlap or similarity and 1 indicating high level 
of affinity or similarity between the windows. This affinity 
matrix indicates the ‘closeness’ between a pair of windows. 
This matrix is used in the next step to resolve associations 
between the agents obtained from the previous frame and the 
persons detected in the current frame. 


2 ) Resolving associations using Binary Integer Program¬ 
ming (BIP): The association of currently detected target 
windows with those obtained from the last frame is not 
straight forward. This is due to the fact that this association 
depends on multiple features. The association obtained using 
one feature might conflict with that obtained using another 
feature. Secondly, there might be cases of one-to-many or 
many-to-one associations between the two sets of windows. 
The first cause is alleviated to some extent by forming the 
affinity matrix where multiple features or criteria are combined 
to form a unique scalar function that indicates the similarity 
or affinity between a pair of windows. The second problem is 
solved by posing it as an optimization problem which is solved 
by using binary integer programming ll^ . The elements of 
affinity matrix are considered to be the decision variables and 
constraints are put over the rows and columns of the matrix, 
so that many-to-one or one-to-many associations do not occur. 
We use the COIN-OR CBC fTlX library in order to solve this 
problem. The parameters of the proposed BIP formulation are 
as follows: 


• G M : coefficient of matching or similarity 

between a given pair of windows in the affinity matrix. 


Agents found in the current frame using 
the HoG detector 
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Fig. 3: Affinity Matrix for a given frame. A non-zero value 
indicates ‘closeness’ between a pair of windows. The values 
are normalized between 0 and 1. 


• Uij G {0,1} : decision variable, Uij = 1, if the 
matching pair (i,j) is selected, else Ui^ = 0. 

The optimization problem is now stated as follows: 


m n 


arg max 

u 

i=l j=l 

m 


(3a) 

subject to 

i=l 

Vj 

(3b) 


n 

Ui,j < 1 

1 — 1 

Vi 

(3c) 


J — ^ 

V ij 

(3d) 


The objective function (1^ aims at maximizing the asso¬ 
ciation between a pair of windows as given by the affinity 
matrix. The constraints (l3bl) and (O allow only one-to-one 
association between the considered pair of windows. The 
bound (l3dl) restricts the decision variable to be binary. The 
decision variables having value of 1 in the BIP solution 
correspond to the selected pair of windows. The optimization 
process for resolving association is illustrated in Figure (H 
We consider frame number 91 in the ETH2 dataset. In this 
image, three detected target windows are labelled as a, b and c 
respectively. From the previous frame, 4 windows are obtained 
using KF-fMS tracker. These windows have labels 3,4,5 and 
6. The conflicting associations arise due to the pairs shown in 
green ellipse in the affinity matrix. The BIP module gives rise 
to a binary matrix providing unique associations between the 
two sets of windows. Hence, the target window a is assigned 
the label 5, 6 is assigned 4 and c is assigned label 3. The 
window ID 6 is not associated with any target window and 
hence appears as an ellipse in the final image. 
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Fig. 4: Resolving associations using Binary Integer Program¬ 
ming. The HoG detected windows (a,b,c) are associated with 
the agents obtained from the tracker (3,4,5,6). Conflicting 
associations are shown as green ellipse in the affinity matrix. 
The output of BIP is a binary association matrix providing 
flnal labels for the HoG windows. Agent 6 is not associated 
with any HoG window. 


3) Occlusion Handling through SURF Matching: We will 
use some additional notations in order to explain the method 
presented in this section. As stated above, each new agent is 
assigned an unique global ID which is used to identify this 
agent wherever it is visible in a video. We use the notation 
L{Ak{i)) to denote the global ID of a given agent in the frame 
Ik. The symbol {L{Ak)} refers to the set of labels for all active 
agents in this frame. We also deflne a set Gk that consists of 
all pairs of agent IDs that overlap with each other. In other 
words, 

(4) 


where each element gl. is a pair of agent labels (IDs) given by 

9k = {L{Ak{p)), L{Ak{q))},p + q, {p, g) e {0,1 ,..., n} 

(5) 

As explained in the previous section, the global labels of 
the active agents obtained from the detector in the current 
frame is resolved by the binary integer programming (BIP) 
module which uniquely assigns the labels of agents from the 
last frame L{Ak-i) to the currently detected agents. Let us 
denote this set of labels for the currently detected agents by 
the symbol {L~{Ak)}. Some of the labels obtained from the 
BIP module might be erroneous, particularly for those agents 
which get occluded or appear in groups in the current frame. 
This is due to the fact that the decision of the BIP module 
solely depends on the features used in the affinity matrix. Even 
though multiple features or cues will provide robustness, yet 
it can guarantee correct decisions for all cases. 

Therefore, a second stage of veriflcation is employed to 
correct these labels by using the pairing information obtained 
from the last frame Gk-i and SURF matching as explained 
in Algorithm [T] In this algorithm, n{Gk-i) refers to the 
cardinality of the set Gk-i. The basic idea is that if one of 
the agents in the pair disappears in the current frame, SURF 
matching is used to recognize the agent which is available, 
and is assigned the corresponding agent label. The new set of 
global IDs obtained for the currently active agents is denoted 
by {L{Ak)}. Once the labels are found, a new set of agent 
pairs are found based on whether they overlap or not. This 
group is denoted by Gk and will be utilized in the next 
iteration. The scheme is explained pictorially in Figure [5] Let 
us assume that the agents (A, B) form a pair in the previous 
frame Ik-i and in the current frame only one window is 
detected by the detector. Let us call it G. Also assume that 
the BIP module associates window G with A. In this case, 
the SURF matching between the pair (A,C) and (B,C) is 
used to confirm the flnal association. The one with maximum 
percentage match is selected as the correct association pair. 


Algorithm 1 Occlusion Handling through SURF-Matching 

1: for i = 0 to n{Gk-i) do 

2 : gi_^ = {L{Ak-M).L{Ak-M)] 

3: if both the labels are in {L (Ak)} then 

4: Do nothing 

5: else if both the labels are not present then 

6: Do nothing 

7: else {only one of the two labels, say, p is present} 

8: Let t be the index s.t. L~{Ak{t)) = L{Ak-i{p)) 

9: Compute SURF matching within the pairs Ak-i{p) ^ 

Ak{t) and Ak-i{q) -- Ak{t) 

10: New label to Ak{t) is assigned as follows: 

L{Ak{t)) = L(A/c_i(s))| 5 = argmaXs{A/c_i(5) -- 
Ak{t),s e (p,g)} 

11: end if 

12: end for 


B. Estimating agent location using Kalman Filter and Mean- 
shift tracker 

As stated earlier, the location of agents from the last frame 
is estimated in the current frame using a mean-shift tracker 


Gk = {9k}^ i = ,r 
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Fig. 5: Occlusion Handling through SURF Matching. If one of 
the agents from the previous pairs is not present in the current 
frame, use SURF-matching to identify the available agent. 


combined with a Kalman Filter. It is well known that the 
detector may not provide detection for a given agent in every 
frame where it is located. In case of detection failure, a Kalman 
Filter could be used to predict its location. The Kalman Filter 
itself learns from the observations obtained from the detector. 
The reliance on object detector which is computationally 
expensive could be reduced by using a mean-shift tracker 
na that uses a colour histogram to locate target in the next 
frame. The mean-shift tracker is initialized for each new agent 
obtained from the detector. In cases where the detector fails to 
locate this agent, the Kalman Filter and the mean-shift tracker 
could be used together to confirm the location of the said agent. 
Moreover, use of mean-shift tracker along with a Kalman Filter 
could be used to reduce the computational cost by reducing 
the search area for the detector. 

III. Experimental Results 

The performance of our algorithm is evaluated on three 
datasets which are the same as used by Yang and Nevada 
ID: TUD dataset El, PETS 2009 El and ETH dataset 0. 
Since we wanted to compare our results with those reported 
in 0, we used the same detector and performance parameters 
to compute our tracking results. The resulting comparison is 
provided in Table H The performance parameters used are: 
precision, recall, false alarm per frame (FAF), ground truth 


(GT), mostly tracked trajectories (MT), partially tracked trajec¬ 
tories (PT), mostly lost trajectories (ML), number of trajectory 
fragmentation (Frag) and number of id switches (IDS). Please 
refer to the above paper for definitions of various parameters 
mentioned in this table. Few such parameters computed for one 
of the ETH datasets are shown in Eigure[71 In this figure, false 
trajectories (ET) are those which are generated due to a false 
detection made by the HoG detector. Some of the snapshots 
of various agent trajectories for different datasets are shown 
in Eigure O The complete tracking video is made available 
on web ll^ for the convenience of readers. The snapshots 
show some of the cases where our scheme is able to resolve 
associations resulting in accurate tracking for the agent. 

We can see that the performance of our algorithm is not 
good compared to the Nevada’s latest work 0 even though 
we have better tracking performance such MT, PT and ML. It 
is to be noted that Nevada’s work is based on tracklets that 
introduces latency into the decision making process unlike our 
approach where we take decision per frame basis. However, 
this is an initial work which can be improved in several ways. 
Some of them are as follows: 

1) We have more trajectory fragmentation and IDS be¬ 
cause, we create new IDs for the same agent if it remains 
occluded or not detected for a certain number of frames. One 
approach would be to compare the currently detected targets 
with not only with the last frame but also past trajectories. 
2) We are using Kalman Eilter as the motion predictor for 
each agent. Probably, this assumption is not valid in case of 
camera motion. The relative position of agents could be used 
as a parameter for resolving associations between agents as 
suggested in 0 . 3) The direction of each agent’s motion along 
with motion coherence can be incorporated into the afdnity 
matrix. 4) The values of oti in equation o is decided a priori 
by the user. This could be treated as variables to be optimized 
over another set of constraints. 5) Taking cue from Nevada’s 
work 0 , the relative location of pedestrians could be utilized 
to compensate for camera motion. 


IV. Conclusion and Euture Work 

In this paper, we take a relook at the multi-target tracking 
problem. Our main contribution lies in proposing an occlusion 
reasoning scheme to solve the association among the detected 
agents on a frame-by-frame basis. The scheme defines an 
affinity matrix that depicts the closeness between the estimated 
agent windows of the previous frame with those obtained from 
a detector in the current frame. This affinity matrix is later used 
by a binary integer programming (BIP) module to find unique 
associations between these pair of windows. A second stage 
of verification based on SURE-matching is employed to deal 
with the wrong associations generated by the BIP module. This 
module makes use of past agent pair information to resolve the 
agent identities in the current frame. The performance of our 
algorithm is compared with the latest work in this field. It is 
still an initial work with a lot of scope for improvement. The 
work presented here will be useful for students and practicing 
engineers who would like to understand the process and the 
underlying challenges of the problem. 









Method 

Recall (%) 

Precision (%) 

1 PAP 1 


1 MT(%) 

1 PT(%) 1 

1 ML(%) 

1 Frag 

1 IDS 

ETH Dataset 

Our approach 

87.7 

49.8 

6.76 

125 

78.2 

17.8 

4.0 

182 

27 

Yang & Nevada (2014) 

79.0 

90.4 

0.637 

125 

68.0 

24.8 

7.2 

19 

11 

PETS 2009 Dataset 

Our approach 

97.3 

68.0 

2.66 

19 

94.7 

5.3 

0.0 

64 

8 

Yang & Nevada (2014) 

93.0 

95.3 

0.268 

19 

89.5 

10.5 

0.0 

13 

0 

TUD Dataset 

Our approach 

94.2 

77.0 

1.74 

10 

100 

0.0 

0.0 

10 

3 

Yang & Nevada (2014) 

87.0 

96.7 

0.184 

10 

70.0 

30.0 

0.0 

1 

0 


TABLE I: Comparison of Tracking results for different datasets 


MT - PT ML FT IS x 



Fig. 7: Tracking performance for ETH2 dataset. False trajecto¬ 
ries are generated due to wrong detections by the HoG detector. 
Most agents (shown in green) are correctly tracked with the 
help of the proposed occlusion reasoning scheme. 
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(If) (Ig) (Ih) (li) (Ij) 

ETHl dataset: (la-le) Trajectory of agent 2 (lady on left with blue window) intersects with that of agent 1 (person 
in the centre) without any ID switch. (If-lj): Agent 1 (brown) is tracked successfully even when it forms group 
with other agents. Ho w ever ID switch occur s with agent 6 



(2f) 

Dataset ETH2: (2a-2e) 
instances of occlusion. 




(3f) (3g) (3h) (3g) (3i) 

ETH3: (3a-3e) Agents 1 and 3 are tracked successfully in presence of various false positives due to shadow and 
reflection. (3f-3j) Shows two false trajectories generated as a result of detection failure (agent 6 and 11). 


Eig. 6: Snapshots of trajectories generated for various agents for three different ETH datasets. Predicted agent location is shown 
as an ellipse. Each detected agent window is shown with a rectangular bounding box with its agent ID. It shows several instances 
where the ID switch is prevented and the target is tracked successfully despite occlusion and other effects. 

































































































(4f) (4g) (4h) (4i) (4j) 

ETH4: (4a-4e) The lady on right gets a new ID as it recovers from occlusion. It also shows several 
cases of ID switch. (4f-4j) Agent 79 (navy blue) is tracked successfully over a span of more than 100 
frames and then undergoes an ID switch. This video has significant camera motion which resulting in’ 
rformance 


CAVIARl: (5a-5e) Identity of agent 2 (shown in pink) is restored once it recovers from occlusion by 
agent 1. (5f-5j) Two ID switch occurs among the three agents which move in a group. 


Fig. 6: Snapshots of tracking performance for Video datasets ETH4 and CAVIARE CAVIAR dataset has a static background 
while ETH datasets have dynamic background. 








































